{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# 第18章 概率潜在语义分析"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## 习题18.1\n",
    "&emsp;&emsp;证明生成模型与共现模型是等价的。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答思路：**\n",
    "\n",
    "1. 生成模型的定义\n",
    "2. 共现模型的定义\n",
    "3. 证明生成模型与共现模型是等价的"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答步骤：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第1步：生成模型**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第18.1.2节的生成模型的定义：\n",
    "\n",
    "> &emsp;&emsp;假设有单词集合$W=\\{ w_1, w_2, \\cdots, w_M\\}$，其中$M$是单词个数；文本（指标）集合$D=\\{ d_1, d_2, \\cdots, d_N \\}$，其中$N$是文本个数；话题集合$Z=\\{z_1,z_2,\\cdots, z_K\\}$，其中$K$是预先设定的话题个数。随机变量$w$取值于单词集合；随机变量$d$取值于文本集合，随机变量$z$取值于话题集合。概率分布$P(d)$、条件概率分布$P(z|d)$、条件概率分布$P(w|z)$皆属于多项分布，其中$P(d)$表示生成文本$d$的概率，$P(z|d)$表示文本$d$生成话题$z$的概率，$P(w|z)$表示话题$z$生成单词$w$的概率。  \n",
    "> \n",
    "> &emsp;&emsp;每个单词-文本对$(w,d)$的生成概率由以下公式决定：\n",
    "> $$\n",
    "\\begin{aligned}\n",
    "P(w,d) \n",
    "&= P(d)P(w|d) \\\\\n",
    "&= P(d) \\sum_z P(w,z|d) \\\\\n",
    "&= P(d) \\sum_z P(z|d)P(w|z)\n",
    "\\end{aligned} \\tag{1}\n",
    "$$\n",
    "> 即生成模型的定义。  \n",
    "> &emsp;&emsp;生成模型假设在话题$z$给定条件下，单词$w$与文本$d$条件独立，即\n",
    "> $$\n",
    "P(w,z|d) = P(z|d) P(w|z)\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第2步：共现模型**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第18.1.3节的共现模型的定义：\n",
    "\n",
    "> &emsp;&emsp;每个单词-文本对$(w,d)$的概率由以下公式决定：  \n",
    "> $$\n",
    "P(w,d) = \\sum_{z \\in Z} P(z) P(w|z) P(d|z) \\tag{2}\n",
    "$$\n",
    "> 即共现模型的定义。  \n",
    "> &emsp;&emsp;共现模型假设在话题$z$给定条件下，单词$w$与文本$d$是条件独立的，即\n",
    "> $$\n",
    "P(w,d|z) = P(w|z) P(d|z)\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第3步：证明生成模型与共现模型是等价的**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;结合公式(1)和公式(2)，可得：\n",
    "$$\n",
    "\\begin{aligned}\n",
    "P(w,d)\n",
    "&= P(d) \\sum_z P(z|d)P(w|z) \\\\\n",
    "&= \\sum_z P(w|z)P(z|d)P(d) \\\\\n",
    "&= \\sum_z P(w,z|d)P(d) \\\\\n",
    "&= \\sum_z P(w,d,z) \\\\\n",
    "&= \\sum_z P(z)P(w,d|z) \\\\\n",
    "&= \\sum_z P(z)P(w|z)P(d|z)\n",
    "\\end{aligned}\n",
    "$$\n",
    "&emsp;&emsp;可知生成模型与共现模型的概率公式是等价的，故证得生成模型与共现模型是等价的。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 习题18.2\n",
    "&emsp;&emsp;推导共现模型的EM算法。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答思路：**  \n",
    "\n",
    "1. EM算法的一般步骤\n",
    "2. 推导共现模型的$Q$函数\n",
    "3. 推导共现模型的E步\n",
    "4. 推导共现模型的M步\n",
    "5. 写出共现模型的EM算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答步骤：**   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第1步：EM算法**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第18.2节的EM算法介绍：\n",
    "\n",
    "> &emsp;&emsp;EM算法是一种迭代算法，每次迭代包括交替的两步：E步，求期望；M步，求极大。E步是计算$Q$函数，即完全数据的对数似然函数对不完全数据的条件分布的期望。M步是对$Q$函数极大化，更新模型参数。详细介绍见第9章。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第2步：推导共现模型的$Q$函数推导**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;设单词集合为$W=\\{ w_1, w_2, \\cdots, w_M\\}$，文本集合为$D=\\{ d_1, d_2, \\cdots, d_N \\}$，话题集合为$Z=\\{z_1,z_2,\\cdots, z_K\\}$。给定单词-文本共现数据$T=\\{ n(w_i, d_j)\\},i=1,2,\\cdots,M, \\ j=1,2,\\cdots,N$，目标是估计概率潜在语义模型（共现模型）的参数。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第18.1.3节的共现模型定义：\n",
    "\n",
    "> &emsp;&emsp;文本-单词共现数据$T$的生成概率为所有单词-文本对$(w,d)$的生成概率的乘积：\n",
    "> \n",
    "> $$\n",
    "P(T) = \\prod_{(w,d)} P(w,d)^{n(w,d)} \\tag{1}\n",
    "$$\n",
    "> \n",
    "> 每个单词-文本对$(w,d)$的概率由以下公式决定：\n",
    "> \n",
    "> $$\n",
    "P(w, d) = \\sum_{z \\in Z} P(z)P(w|z)P(d|z) \\tag{2}\n",
    "$$\n",
    "> 即共现模型的定义。  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;对公式（1）使用极大似然估计，结合公式（2），对数似然函数是\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\log P(T) &= \\sum_{i=1}^M \\sum_{j=1}^N n(w_i, d_j) \\log P(w_i, d_j) \\\\\n",
    "&= \\sum_{i=1}^M \\sum_{j=1}^N n(w_i, d_j) \\log \\left[ \\sum_{k=1}^K P(z_k) P(w_i | z_k) P(d_j | z_k) \\right]\n",
    "\\end{aligned}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;由于$\\log$函数是上凸函数，可以基于`Jessen`不等式，得：\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "& \\sum_{i=1}^M \\sum_{j=1}^N n(w_i, d_j) \\log \\left[ \\sum_{k=1}^K P(z_k) P(w_i | z_k) P(d_j | z_k) \\right] \\\\\n",
    "\\geqslant &  \\sum_{i=1}^M \\sum_{j=1}^N n(w_i, d_j) \\sum_{k=1}^K \\log \\Big[ P(z_k) P(w_i | z_k) P(d_j | z_k) \\Big]\n",
    "\\end{aligned}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;可以得到$Q$函数：\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "Q \n",
    "&= E_Z[\\log P(T) |z] \\\\\n",
    "&= E_Z \\left\\{ \\sum_{i=1}^M \\sum_{j=1}^N n(w_i, d_j) \\sum_{k=1}^K \\log \\Big[ P(z_k) P(w_i | z_k) P(d_j | z_k) \\Big] \\right \\} \\\\\n",
    "&= \\sum_{i=1}^M \\sum_{j=1}^N n(w_i, d_j) E_Z \\left\\{ \\sum_{k=1}^K \\log \\Big[ P(z_k) P(w_i | z_k) P(d_j | z_k) \\Big] \\right \\} \\\\\n",
    "&= \\sum_{i=1}^M \\sum_{j=1}^N n(w_i, d_j) \\sum_{k=1}^K P(z_k | w_i, d_j) \\log \\Big[ P(z_k) P(w_i | z_k) P(d_j | z_k) \\Big]\n",
    "\\end{aligned}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第3步：推导共现模型的E步**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据贝叶斯公式计算\n",
    "$$\n",
    "P(z_k | w_i, d_j) = \\frac{P(z_k) P(w_i | z_k) P(d_j | z_k)}{\\displaystyle \\sum_{k=1}^K P(z_k) P(w_i | z_k) P(d_j | z_k)}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第4步：推导共现模型的M步**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;通过约束最优化求解$Q$函数的极大值，这时$P(z_k)$、$P(w_i | z_k)$和$P(d_j | z_k)$是变量。因为变量$P(z_k)$、$P(w_i | z_k)$和$P(d_j | z_k)$形成概率分布，满足约束条件\n",
    "$$\n",
    "\\begin{array}{ll}\n",
    "\\displaystyle \\sum_{k=1}^K P(z_k) = 1 \\\\\n",
    "\\displaystyle \\sum_{i=1}^M P(w_i | z_k) = 1, \\quad k=1,2,\\cdots,K \\\\\n",
    "\\displaystyle \\sum_{j=1}^N P(d_j | z_k) = 1, \\quad k=1,2,\\cdots,K \n",
    "\\end{array}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用拉格朗日法，引入拉格朗日乘子$\\alpha,\\beta,\\gamma$，定义拉格朗日函数$\\Lambda$\n",
    "$$\n",
    "\\Lambda = Q + \\alpha \\Big(1 - \\sum_{k=1}^K P(z_k) \\Big) + \\sum_{k=1}^K \\beta_k \\Big( 1- \\sum_{i=1}^M P(w_i | z_k) \\Big) + \\sum_{k=1}^K \\gamma_k \\Big( 1 - \\sum_{j=1}^N P(d_j | z_k) \\Big)\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "将拉格朗日函数$\\Lambda$分别对$P(z_k)$、$P(w_i | z_k)$和$P(d_j | z_k)$求偏导数，并令其等于0，得到下面的方程组\n",
    "$$\n",
    "\\begin{array}{lll}\n",
    "\\displaystyle \\sum_{i=1}^M \\sum_{j=1}^N n(w_i, d_j) P(z_k | w_i, d_j) - \\alpha P(z_k) = 0,& k = 1,2,\\cdots, K \\\\\n",
    "\\displaystyle \\sum_{j=1}^N n(w_i, d_j) P(z_k | w_i, d_j) - \\beta_k P(w_i | z_k) = 0, & i=1,2, \\cdots, M; & k = 1,2,\\cdots, K \\\\\n",
    "\\displaystyle \\sum_{i=1}^M n(w_i, d_j) P(z_k | w_i, d_j) - \\gamma_k P(d_j | z_k) = 0, & j=1,2, \\cdots, N; & k = 1,2,\\cdots, K\n",
    "\\end{array} \n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "解方程组得M步的参数估计公式：\n",
    "$$\n",
    "\\begin{array}{ll}\n",
    "\\displaystyle p(z_k) = \\frac{\\displaystyle \\sum_{i=1}^M \\sum_{j=1}^N n(w_i, d_j) P(z_k | w_i, d_j)}{\\displaystyle \\sum_{j=1}^N n(d_j)} \\\\\n",
    "\\displaystyle p(w_i | z_k) = \\frac{\\displaystyle \\sum_{j=1}^N n(w_i, d_j) P(z_k | w_i, d_j)}{\\displaystyle \\sum_{m=1}^M \\sum_{j=1}^N n(w_m, d_j) P(z_k | w_m, d_j)} \\\\\n",
    "\\displaystyle p(d_j | z_k) = \\frac{\\displaystyle \\sum_{i=1}^M n(w_i, d_j) P(z_k | w_i, d_j)}{\\displaystyle \\sum_{i=1}^M \\sum_{l=1}^N n(w_i, d_l) P(z_k | w_i, d_l)}\n",
    "\\end{array}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第5步：共现模型的EM算法**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "输入：设单词集合为$W=\\{ w_1, w_2, \\cdots, w_M\\}$，文本集合为$D=\\{ d_1, d_2, \\cdots, d_N \\}$，话题集合为$Z=\\{z_1,z_2,\\cdots, z_K\\}$，共现数据$\\{ n(w_i, d_j)\\},i=1,2,\\cdots,M, \\ j=1,2,\\cdots,N$；  \n",
    "输出：$P(z_k)$、$P(w_i | z_k)$和$P(d_j | z_k)$。  \n",
    "（1）设置参数$P(z_k)$、$P(w_i | z_k)$和$P(d_j | z_k)$的初始值。  \n",
    "（2）迭代执行以下E步、M步，直到收敛为止。  \n",
    "&emsp;&emsp;&nbsp;&nbsp;E步：\n",
    "$$\n",
    "P(z_k | w_i, d_j) = \\frac{P(z_k) P(w_i | z_k) P(d_j | z_k)}{\\displaystyle \\sum_{k=1}^K P(z_k) P(w_i | z_k) P(d_j | z_k)}\n",
    "$$\n",
    "&emsp;&emsp;&nbsp;&nbsp;M步：\n",
    "$$\n",
    "\\begin{array}{ll}\n",
    "\\displaystyle p(z_k) = \\frac{\\displaystyle \\sum_{i=1}^M \\sum_{j=1}^N n(w_i, d_j) P(z_k | w_i, d_j)}{\\displaystyle \\sum_{j=1}^N n(d_j)} \\\\\n",
    "\\displaystyle p(w_i | z_k) = \\frac{\\displaystyle \\sum_{j=1}^N n(w_i, d_j) P(z_k | w_i, d_j)}{\\displaystyle \\sum_{i=1}^M \\sum_{j=1}^N n(w_i, d_j) P(z_k | w_i, d_j)} \\\\\n",
    "\\displaystyle p(d_j | z_k) = \\frac{\\displaystyle \\sum_{i=1}^M n(w_i, d_j) P(z_k | w_i, d_j)}{\\displaystyle \\sum_{i=1}^M \\sum_{j=1}^N n(w_i, d_j) P(z_k | w_i, d_j)}\n",
    "\\end{array}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 习题18.3\n",
    "&emsp;&emsp;对以下文本数据集进行概率潜在语义分析。\n",
    "![18-3-Text-Dataset.png](./images/18-3-Text-Dataset.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答思路：**  \n",
    "\n",
    "1. 给出概率潜在语义模型参数估计的EM算法\n",
    "2. 自编程实现概率潜在语义模型参数估计的EM算法\n",
    "3. 对文本数据集进行概率潜在语义分析"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答步骤：**   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第1步：概率潜在语义模型参数估计的EM算法**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第18章的算法18.1的概率潜在语义模型参数估计的EM算法：\n",
    "\n",
    "> **算法18.1（概率潜在语义模型参数估计的EM算法）**  \n",
    "> 输入：设单词集合为$W=\\{ w_1, w_2, \\cdots, w_M\\}$，文本集合为$D=\\{ d_1, d_2, \\cdots, d_N \\}$，话题集合为$Z=\\{z_1,z_2,\\cdots, z_K\\}$，共现数据$\\{ n(w_i, d_j)\\},i=1,2,\\cdots,M, \\ j=1,2,\\cdots,N$；  \n",
    "> 输出：$P(w_i | z_k)$和$P(z_k | d_j)$。  \n",
    "> （1）设置参数$P(w_i | z_k)$和$P(z_k | d_j)$的初始值。  \n",
    "> （2）迭代执行以下E步、M步，直到收敛为止。  \n",
    "> &emsp;&emsp;&nbsp;&nbsp;E步：\n",
    "> $$\n",
    "P(z_k | w_i, d_j) = \\frac{P(w_i | z_k) P(z_k | d_j)}{\\displaystyle \\sum_{k=1}^K P(w_i | z_k) P(z_k | d_j)}\n",
    "$$\n",
    "> &emsp;&emsp;&nbsp;&nbsp;M步：\n",
    "> $$\n",
    "\\begin{array}{ll}\n",
    "\\displaystyle p(w_i | z_k) = \\frac{\\displaystyle \\sum_{j=1}^N n(w_i, d_j) P(z_k | w_i, d_j)}{\\displaystyle \\sum_{m=1}^M \\sum_{j=1}^N n(w_m, d_j) P(z_k | w_m, d_j)} \\\\\n",
    "\\displaystyle p(d_j | z_k) = \\frac{\\displaystyle \\sum_{i=1}^M n(w_i, d_j) P(z_k | w_i, d_j)}{n(d_j)}\n",
    "\\end{array}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第2步：自编程实现概率潜在语义模型参数估计的EM算法**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "\n",
    "class EMPlsa:\n",
    "    def __init__(self, max_iter=100, random_state=2022):\n",
    "        \"\"\"\n",
    "        基于生成模型的EM算法的概率潜在语义模型\n",
    "        :param max_iter: 最大迭代次数\n",
    "        :param random_state: 随机种子\n",
    "        \"\"\"\n",
    "        self.max_iter = max_iter\n",
    "        self.random_state = random_state\n",
    "\n",
    "    def fit(self, X, K):\n",
    "        \"\"\"\n",
    "        :param X: 单词-文本矩阵\n",
    "        :param K: 话题个数\n",
    "        :return: P(w_i|z_k) 和 P(z_k|d_j)\n",
    "        \"\"\"\n",
    "        # M, N分别为单词个数和文本个数\n",
    "        M, N = X.shape\n",
    "\n",
    "        # 计算n(d_j)\n",
    "        n_d = [np.sum(X[:, j]) for j in range(N)]\n",
    "\n",
    "        # (1)设置参数P(w_i|z_k)和P(z_k|d_j)的初始值\n",
    "        np.random.seed(self.random_state)\n",
    "        p_wz = np.random.random((M, K))\n",
    "        p_zd = np.random.random((K, N))\n",
    "\n",
    "        # (2)迭代执行E步和M步，直至收敛为止\n",
    "        for _ in range(self.max_iter):\n",
    "            # E步\n",
    "            P = np.zeros((M, N, K))\n",
    "            for i in range(M):\n",
    "                for j in range(N):\n",
    "                    for k in range(K):\n",
    "                        P[i][j][k] = p_wz[i][k] * p_zd[k][j]\n",
    "                    P[i][j] /= np.sum(P[i][j])\n",
    "\n",
    "            # M步\n",
    "            for k in range(K):\n",
    "                for i in range(M):\n",
    "                    p_wz[i][k] = np.sum([X[i][j] * P[i][j][k] for j in range(N)])\n",
    "                p_wz[:, k] /= np.sum(p_wz[:, k])\n",
    "\n",
    "            for k in range(K):\n",
    "                for j in range(N):\n",
    "                    p_zd[k][j] = np.sum([X[i][j] * P[i][j][k] for i in range(M)]) / n_d[j]\n",
    "\n",
    "        return p_wz, p_zd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第3步：对文本数据集进行概率潜在语义分析**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "参数P(w_i|z_k)：\n",
      "[[0.    0.162 0.   ]\n",
      " [0.    0.    0.2  ]\n",
      " [0.116 0.081 0.   ]\n",
      " [0.115 0.    0.1  ]\n",
      " [0.    0.081 0.1  ]\n",
      " [0.423 0.271 0.2  ]\n",
      " [0.    0.162 0.   ]\n",
      " [0.115 0.    0.1  ]\n",
      " [0.    0.    0.3  ]\n",
      " [0.    0.243 0.   ]\n",
      " [0.231 0.    0.   ]]\n",
      "参数P(z_k|d_j)：\n",
      "[[0.    1.    0.    0.553 1.    0.    1.    0.    0.   ]\n",
      " [1.    0.    1.    0.447 0.    0.    0.    1.    0.   ]\n",
      " [0.    0.    0.    0.    0.    1.    0.    0.    1.   ]]\n"
     ]
    }
   ],
   "source": [
    "# 输入文本-单词矩阵，共有9个文本，11个单词\n",
    "X = np.array([[0, 0, 1, 1, 0, 0, 0, 0, 0],\n",
    "              [0, 0, 0, 0, 0, 1, 0, 0, 1],\n",
    "              [0, 1, 0, 0, 0, 0, 0, 1, 0],\n",
    "              [0, 0, 0, 0, 0, 0, 1, 0, 1],\n",
    "              [1, 0, 0, 0, 0, 1, 0, 0, 0],\n",
    "              [1, 1, 1, 1, 1, 1, 1, 1, 1],\n",
    "              [1, 0, 1, 0, 0, 0, 0, 0, 0],\n",
    "              [0, 0, 0, 0, 0, 0, 1, 0, 1],\n",
    "              [0, 0, 0, 0, 0, 2, 0, 0, 1],\n",
    "              [1, 0, 1, 0, 0, 0, 0, 1, 0],\n",
    "              [0, 0, 0, 1, 1, 0, 0, 0, 0]])\n",
    "\n",
    "# 设置精度为3\n",
    "np.set_printoptions(precision=3, suppress=True)\n",
    "\n",
    "# 假设话题的个数是3个\n",
    "k = 3\n",
    "\n",
    "em_plsa = EMPlsa(max_iter=100)\n",
    "\n",
    "p_wz, p_zd = em_plsa.fit(X, 3)\n",
    "\n",
    "print(\"参数P(w_i|z_k)：\")\n",
    "print(p_wz)\n",
    "print(\"参数P(z_k|d_j)：\")\n",
    "print(p_zd)"
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "c50c7698ff25eb1d7dce4c52e3d5cc14b3b23d6a97d4fb70aad8c815eba6f63e"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "273.188px"
   },
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
